EDA and predictive modelling of airbnb prices¶

What's in this notebook:¶

Data prep¶

  • add city field
  • add weekend/weekday flag
  • combine all files
  • EDA

    • summary statistics for each variable
    • distribution of prices
    • relationship between prices and other variables
  • Regression modellng to estimate how different attributes affect listing price

  • Does being close to the cluster centre affect prices?

  • Is there a superhost premium? (Do superhosts charge more?)

    • propensity score matching between a superhost and non-superhost

About the data¶

I used a data set that contains Airbnb listings in 10 European cities, with the following attributes available for each listing:

  • realSum: the full price of accommodation for two people and two nights in EUR
  • room_type: the type of the accommodation
  • room_shared: dummy variable for shared rooms
  • room_private: dummy variable for private rooms
  • person_capacity: the maximum number of guests
  • host_is_superhost: dummy variable for superhost status
  • multi: dummy variable if the listing belongs to hosts with 2-4 offers
  • biz: dummy variable if the listing belongs to hosts with more than 4 offers
  • cleanliness_rating: cleanliness rating
  • guest_satisfaction_overall: overall rating of the listing
  • bedrooms: number of bedrooms (0 for studios)
  • dist: distance from city centre in km
  • metro_dist: distance from nearest metro station in km
  • attr_index: attraction index of the listing location
  • attr_index_norm: normalised attraction index (0-100)
  • rest_index: restaurant index of the listing location
  • attr_index_norm: normalised restaurant index (0-100)
  • lng & lat: longitude and latitude of the listing location

Data source: Kaggle https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm
from scipy.stats import pearsonr

Data preparation¶

Import data¶

In [3]:
cities_list = ['amsterdam', 'athens', 'barcelona', 'berlin', 'budapest', 
            'lisbon', 'london', 'paris', 'rome', 'vienna']
In [4]:
weekdays_dict = {}
weekends_dict = {}

for city in cities_list:
    df_weekdays = pd.read_csv(f'data/{city}_weekdays.csv')
    df_weekends = pd.read_csv(f'data/{city}_weekends.csv')
    
    # city variable
    df_weekdays['city'] = city
    df_weekends['city'] = city
    
    # dummy variable for weekend
    df_weekdays['is_weekend'] = 0
    df_weekends['is_weekend'] = 1
    
    df_weekdays['week_part'] = 'weekday'
    df_weekends['week_part'] = 'weekend'
    
    weekdays_dict[city] = df_weekdays
    weekends_dict[city] = df_weekends
    print(city, f'weekdays: {len(df_weekdays)}, weekends: {len(df_weekends)}')  
In [5]:
#  combine into one dataframe
df = pd.concat(list(weekdays_dict.values())+list(weekends_dict.values())).drop(columns=['Unnamed: 0'])
In [6]:
# convert realSum to price per night
df['price_per_night'] = df['realSum']/2
In [7]:
df.info()

Exploratory data analysis¶

Athens have the best guest satisfaction and cleanliness ratings, and also the cheapest despite having the highest person capacity.
Amsterdam is the most expensive, presumably because it has the highest average number of bedrooms.
London has the best attraction index but worst guest satisfaction and cleanliness ratings.
Listing prices on weekdays and weekends are similar, except in Amsterdam where prices are higher on weekends.
Listing prices for superhosts and non-superhosts do not differ much in most cities.

averages¶

In [44]:
df_ = df.groupby('city').agg({'price_per_night':'median', 'room_shared':'mean', 'bedrooms':'mean', 'person_capacity':'mean', 
                              'host_is_superhost':'mean', 'multi':'mean', 'biz':'mean', 
                              'guest_satisfaction_overall':'mean', 'cleanliness_rating':'mean', 
                              'dist':'mean', 'metro_dist':'mean', 
                             'attr_index_norm':'mean', 'rest_index_norm':'mean'}).reset_index().sort_values('price_per_night')
df_
Out[44]:
city price_per_night room_shared bedrooms person_capacity host_is_superhost multi biz guest_satisfaction_overall cleanliness_rating dist metro_dist attr_index_norm rest_index_norm
1 athens 63.857709 0.002083 1.271402 3.698106 0.428598 0.267424 0.379924 95.003598 9.638447 1.803080 0.478656 5.740839 9.954268
4 budapest 76.491047 0.003481 1.105669 3.540776 0.378916 0.303332 0.348831 94.585281 9.477374 1.872763 0.544059 12.675248 34.529089
8 rome 91.295911 0.001329 1.229755 3.357372 0.326687 0.385953 0.334109 93.122300 9.514678 3.026982 0.819794 10.426968 25.078056
3 berlin 95.587548 0.029388 1.070451 2.774960 0.257246 0.276167 0.174718 94.323671 9.461755 5.257093 0.836064 16.803111 30.666967
2 barcelona 104.149696 0.004236 1.161313 2.616661 0.181433 0.385104 0.325450 91.109072 9.291564 2.116982 0.441248 16.636220 19.376528
9 vienna 104.247014 0.004524 1.102347 3.350297 0.284139 0.279050 0.339836 93.731128 9.472434 3.139488 0.526670 8.762474 4.239580
5 lisbon 112.687617 0.012841 1.272428 3.343398 0.213951 0.239459 0.587541 91.093875 9.370640 1.966893 0.711482 7.324730 28.274084
6 london 130.647475 0.005004 1.128790 2.846192 0.157410 0.274992 0.387872 90.645652 9.175023 5.326421 1.005547 20.537398 11.234105
7 paris 158.798583 0.014055 0.972787 2.953648 0.140700 0.219498 0.245813 92.037530 9.263606 2.995823 0.227323 18.204358 42.589111
0 amsterdam 230.122091 0.004808 1.292308 2.781731 0.284135 0.283173 0.105288 94.514423 9.465865 2.825052 1.089367 14.246499 26.097566

standard deviation¶

In [45]:
df_ = df.groupby('city').agg({'price_per_night':'std', 'room_shared':'std', 
                              'bedrooms':'std', 'person_capacity':'std', 
                              'host_is_superhost':'std',  'multi':'std', 'biz':'std',
                              'guest_satisfaction_overall':'std', 'cleanliness_rating':'std', 
                              'dist':'std', 'metro_dist':'std',
                              'attr_index_norm':'std', 'rest_index_norm':'std'}).reset_index().sort_values('price_per_night')
df_
Out[45]:
city price_per_night room_shared bedrooms person_capacity host_is_superhost multi biz guest_satisfaction_overall cleanliness_rating dist metro_dist attr_index_norm rest_index_norm
5 lisbon 54.486540 0.112596 0.728539 1.344214 0.410128 0.426790 0.492320 9.148114 0.924080 1.742681 0.920204 5.082390 17.877309
8 rome 59.309052 0.036438 0.549710 1.309052 0.469028 0.486847 0.471704 7.815107 0.808415 1.644095 0.631361 6.631054 13.414188
4 budapest 65.572403 0.058903 0.663484 1.256548 0.485177 0.459754 0.476660 6.525680 0.842693 1.874925 0.856410 6.672535 19.111435
3 berlin 117.664645 0.168926 0.552033 1.188142 0.437204 0.447191 0.379802 6.809406 0.849384 3.692649 1.267283 10.774273 16.634505
1 athens 132.940027 0.045600 0.652575 1.284703 0.494922 0.442657 0.485414 8.348637 0.839767 0.953738 0.284154 4.667181 10.778060
7 paris 165.474872 0.117727 0.642571 1.215007 0.347738 0.413937 0.430601 8.818201 0.974036 1.463542 0.122769 7.759372 15.680438
2 barcelona 177.733944 0.064956 0.517108 1.153124 0.385445 0.486706 0.468625 8.607153 1.014577 1.377859 0.284540 9.591798 10.256285
9 vienna 198.873582 0.067115 0.602819 1.282163 0.451067 0.448596 0.473720 7.220808 0.855439 1.942337 0.516132 6.259531 3.683256
0 amsterdam 215.329203 0.069187 0.736683 1.032634 0.451110 0.450648 0.306999 6.350874 0.813421 2.082573 0.831669 10.335158 17.720931
6 london 235.678632 0.070562 0.579477 1.246235 0.364205 0.446533 0.487289 11.510622 1.166180 2.712573 1.263926 11.914575 6.963803
In [46]:
metric_list = ['price_per_night','room_shared','bedrooms','person_capacity',
               'host_is_superhost','multi', 'biz', 'guest_satisfaction_overall','cleanliness_rating',
               'dist', 'metro_dist',  'attr_index_norm', 'rest_index_norm']

highest = []
lowest = []
for metric in metric_list:
    highest.append(df_.sort_values(metric)['city'].values[-1])
    lowest.append(df_.sort_values(metric)['city'].values[0])
    
pd.DataFrame({'metric':metric_list, 'highest':highest, 'lowest':lowest})
Out[46]:
metric highest lowest
0 price_per_night london lisbon
1 room_shared berlin rome
2 bedrooms amsterdam barcelona
3 person_capacity lisbon amsterdam
4 host_is_superhost athens paris
5 multi rome paris
6 biz lisbon amsterdam
7 guest_satisfaction_overall london amsterdam
8 cleanliness_rating london rome
9 dist berlin athens
10 metro_dist berlin paris
11 attr_index_norm london athens
12 rest_index_norm budapest vienna
In [60]:
fig = go.Figure()

for week_part in ['weekday', 'weekend']:
    fig.add_trace(go.Box(y=df[df['week_part']==week_part]['price_per_night'], 
                         x=df[df['week_part']==week_part]['city'],
                         #boxpoints=False, 
                         name=week_part))# = px.box(df, x='city', y='realSum', color='week_part')

fig.update_layout(
    boxmode='group',
    font_color="black",
    title_font_color="black",
    title='Box plot of listing prices<sup><br>Listing prices on weekdays and weekends are similar, except in Amsterdam where prices are higher on weekends. </sup>', 
    plot_bgcolor='rgba(0, 0, 0, 0)',
    yaxis=dict(title='price per night for two (euros)', 
               range=[0,800], 
               tickformat=',.0f'),
                  

)    
fig.show()

Correlation analysis¶

Here I computed the Pearson correlation coefficients* between all variables. The following variables to be positively correlated:

  • guest satisfaction score & cleanliness rating & being a super host
  • price per night & person capacity & number of bedrooms
  • distance from city centre & distance from metro
  • restaurant and attraction index

These correlations are statistically significant, as shown in p-value matrix.

Nothing surprising here.

* Pearson correlation coefficient is the ratio between the covariance of the two variables and the product of their standard deviations. So it is essentially a normalized measurement of covariance.

In [50]:
vars_of_interest = ['price_per_night','room_shared','bedrooms','person_capacity',
                    'host_is_superhost', 'multi', 'biz',
                    'guest_satisfaction_overall','cleanliness_rating',
                    'dist', 'metro_dist', 'attr_index_norm', 'rest_index_norm']
df_corr = df[vars_of_interest].corr().round(2)

mask = np.triu(np.ones_like(df_corr, dtype=bool))
df_corr_vis = df_corr.mask(mask)

fig = px.imshow(df_corr_vis, text_auto=True, 
               color_continuous_scale=px.colors.diverging.PRGn, 
               color_continuous_midpoint=0)
fig.update_layout(plot_bgcolor='rgba(0, 0, 0, 0)',
                  title='Pearson correlation coefficients',
                  xaxis=dict(tickangle=90), 
                  height=600
                 )

fig.show()
In [51]:
df_pvals = pd.DataFrame(index=vars_of_interest)
for col_metric in vars_of_interest:
    pvals_list = []
    for row_metric in vars_of_interest:
        pvals_list.append(pearsonr(df[col_metric],df[row_metric])[1].round(2))
    df_pvals[col_metric] = pvals_list
In [52]:
df_pvals_vis = df_pvals.mask(mask)

fig = px.imshow(df_pvals_vis, text_auto=True, 
               color_continuous_scale=px.colors.sequential.Purples_r, 
               color_continuous_midpoint=0.1
               )
fig.update_layout(plot_bgcolor='rgba(0, 0, 0, 0)',
                  title='''p-values of the Pearson correlation coefficients
                  <br><sup>(zeros mean that the correlations is so statistically significant that the p-value is smaller than the smallest possible floating point)</sup>''',
                  xaxis=dict(tickangle=90), 
                  height=600
                 )

fig.show()

Fitting OLS trendlines for key variables by city¶

Scatterplots are a quick way of displaying relationship between two variables. To explore whether prices in different cities might be affected by different variables, I created a scatterplot per city and found that:

  • While distance negatively correlates with prices for all cities, it is particularly strong for Amsterdam
  • Guest satisfaction is correlated with prices in Amsterdam, Athens, Lisbon, Paris, and Rome, but not in other cities.
In [54]:
fig = px.scatter(df, 
                x='dist', 
                y='price_per_night',
                hover_data=['price_per_night', 'dist'],
                color='city',
                facet_col='city', 
                facet_col_wrap=5,
                title='Distance from centre avs price',
                trendline='ols',
                height=600, width=1000
                )

update_annotations_and_layout(fig)
fig.show()
In [55]:
fig = px.scatter(df, 
                x='metro_dist', 
                y='price_per_night',
                hover_data=['price_per_night', 'metro_dist'],
                color='city',
                facet_col='city', 
                facet_col_wrap=5,
                title='Distance from metro vs price',
                trendline='ols',
                height=600, width=1000
                )

update_annotations_and_layout(fig)
fig.show()
In [56]:
fig = px.scatter(df,#[df['person_capacity']==2], 
                x='guest_satisfaction_overall', 
                y='price_per_night',
                hover_data=['price_per_night', 'guest_satisfaction_overall'],
                color='city',
                facet_col='city', 
                facet_col_wrap=5,
                title='Guest satisfaction vs price',
                trendline='ols',
                height=600, width=1000
                )

update_annotations_and_layout(fig)
fig.show()

Predictive modelling for prices¶

Here I experimented with two different approaches to predict listing prices, linear regression and decision tree regression. Cross validation is used to identify the best model in each approach. Linear regression performed better than decision tree method, suggesting that the relationship between prices and the explanatory variables are more linear than nonlinear. It is also possible that the decision tree might be overfitting the data, capturing noise and irrelevant patterns instead of the true underlying relationship between variables. Decision trees are more prone to overfitting than linear regression model.

Data preprocessing¶

In [21]:
# create dummy variables for cities
city_dummies = pd.get_dummies(df['city'], prefix='city')
# create dummy variables for room type
room_type_dummies = pd.get_dummies(df['room_type'], prefix='room_type')

# merge city and room type dummies with the data
df_all_features = pd.concat([df, city_dummies, room_type_dummies], axis=1)

# transform binary categorical variables into dummy variables
for col in ['room_shared', 'host_is_superhost']:
    df_all_features[col] = 1*df_all_features[col]
    
# select the relevant variables and load into x and y
X = df_all_features[['person_capacity', 'bedrooms', 'host_is_superhost', 'biz', 
                     'cleanliness_rating', 'guest_satisfaction_overall', 
                     'dist', 'metro_dist', 'attr_index_norm', 'rest_index_norm', 'is_weekend', 
                    ]+city_dummies.columns.tolist()+room_type_dummies.columns.tolist()].drop(columns=['city_lisbon', 'room_type_Shared room'])

y = np.array(df_all_features['price_per_night']).astype(float)

Standardize features¶

Here I use StandardScaler() to standardize features to have a mean of zero and a standard deviation of one. This is because I am going to use recursive feature elimination (RFE) for feature selection. RFE ranks features based on the absolute values of the coefficients, therefore it is important that the features are on the same scale.

In [22]:
from sklearn import preprocessing
In [23]:
standard_scaler = preprocessing.StandardScaler()
X = pd.DataFrame(standard_scaler.fit_transform(X), columns=X.columns)

Split the data into training and test sets¶

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Linear Regression¶

with hyperparameter tuning using grid search cross-validation, evaluated by R-squared (coefficient of determination).

In the script below I will use grid search to tune the number of features to include in the model. The best parameter is identified by cross validation.

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_selection import RFE
from sklearn.metrics import mean_squared_error, r2_score

Prepare data and create instance of grid search object with cross validation¶

In [26]:
# create an instance of the linear regression model that uses RFE (recursive feature elimination)
lr = LinearRegression()
rfelr = RFE(lr)

# specify the number of features as the hyperparameter to tune
param_grid = [{'n_features_to_select':range(1,len(X.columns))}
               ]
# create an instance of the grid search object, specifying a 5-fold cross validation
grid_search = GridSearchCV(estimator=rfelr, param_grid=param_grid, scoring='r2', cv=5)

Fit the grid search object on the training data¶

In [27]:
# fit the grid search object on the training data
import time
start_time = time.time()
grid_search.fit(X_train, y_train)
print(f'{(time.time() - start_time):.0f} seconds' )
In [28]:
print('Best hyperparameters: ', grid_search.best_params_)

Fit the model on the training set using the best n_features_to_select and test performance¶

In [29]:
# fit the model on the training set using the best hyperparameters
lr_best = RFE(LinearRegression(), **grid_search.best_params_)
lr_best = lr_best.fit(X_train, y_train)

# test performance
y_pred = lr_best.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean squared error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Perform cross validation to get a more robust estimate of the model performance¶

In [30]:
# perform cross-validation to get a more robust estimate of the model performance

cv_scores = cross_val_score(lr_best, X, y, cv=5, scoring='r2')
print(f'Cross-validation R-squared: {["%.2f" % scr for scr in cv_scores]}' )
print(f'Mean cross-validation R-squared: {cv_scores.mean():.2f}', )
In [31]:
cv_scores = cross_val_score(lr_best, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Cross-validation neg MSE: {["%.0f" % scr for scr in cv_scores]}')
print(f'Mean cross-validation neg MSE: {cv_scores.mean():.0f}')

Decision tree regressor¶

An alternative to linear regression is decision tree regression. Here I will be replicating the above process but instead of the number of features, I will be tuning the maximum tree depth instead. Tree depth is the number of splits a tree can make before reaching a decision.

In [32]:
from sklearn.tree import DecisionTreeRegressor

Create instance of grid search object with cross validation¶

In [33]:
# create an instance of the decision tree regressor
dcr = DecisionTreeRegressor()

# specify maximum depth as the hyperparameter to tune
param_grid = [{'max_depth':range(1,8)}]

# create an instance of grid search object for the decision tree regressor
grid_search_dcr = GridSearchCV(estimator=dcr, scoring='r2', param_grid=param_grid, cv=5)

Fit the grid search object on the training data¶

In [34]:
start_time = time.time()
# fit the grid search object on the training data
grid_search_dcr.fit(X_train, y_train)
print(f'{(time.time() - start_time):.0f} seconds' )
In [35]:
print('Best hyperparameters: ', grid_search_dcr.best_params_)

Fit the model on the training set using the best max_depth and test performance¶

In [36]:
# fit the model on the training set using the best hyperparameters
dcr_best = DecisionTreeRegressor(**grid_search_dcr.best_params_)
dcr_best.fit(X_train, y_train)

# test performance
y_pred_dcr = dcr_best.predict(X_test)
mse = mean_squared_error(y_test, y_pred_dcr)
r2 = r2_score(y_test, y_pred_dcr)

print('Mean squared error: ', mse)
print('R-squared: ', r2)

Perform cross validation to get a more robust estimate of the model performance¶

In [37]:
# perform cross-validation to get a more robust estimate of the model performance

cv_scores = cross_val_score(dcr_best, X, y, cv=5, scoring='r2')
print(f'Cross-validation R-squared: {["%.2f" % scr for scr in cv_scores]}' )
print(f'Mean cross-validation R-squared: {cv_scores.mean():.2f}', )
In [38]:
cv_scores = cross_val_score(dcr_best, X, y, cv=5, scoring='neg_mean_squared_error')
print(f'Cross-validation neg MSE: {["%.0f" % scr for scr in cv_scores]}')
print(f'Mean cross-validation neg MSE: {cv_scores.mean():.0f}')

Statistical inference¶

RFE picked 17 out of 22 input features for the linear regression model, all 17 features were statistically significant (p-value < 0.05). Here are some interesting observations:

  • Restaurant index was eliminated while attraction index was kept. This shows that when deciding where to stay, airbnb guests value proximity to attractions more than restaurants.
  • Distance from centre was eliminated while distance from metro remained in the model. Remember from the exploratory data analysis we saw that distance from centre and distance from metro were both negatively correlated with listing price? It turns out that metro distance matters more to airbnb guests.
  • The weekend and Vienna dummy variables were also dropped, suggesting that listing prices do not depend on day of the week and that listing prices in Vienna are similar to the 'default' city, Lisbon, which was omitted to avoid multicollinearity.

Using the statsmodel library to retrieve the p-values of each included feature

The decision tree regressor assigned a feature importance score of 0 to all but 6 features had importance score of zero.

Linear regression feature coefficients¶

In [39]:
pd.DataFrame({'feature_name':lr_best.get_feature_names_out(), 'feature_coef': lr_best.estimator_.coef_}).sort_values('feature_coef', ascending=False)

Eliminated features¶

In [40]:
[feat for feat in lr_best.feature_names_in_ if feat not in lr_best.get_feature_names_out()]

Using statsmodel library to get p-values of the feature coefficients

x = df_all_features[lr_best.get_feature_names_out().tolist()] x = sm.add_constant(x) model = sm.OLS(y, x).fit() model.summary()

Decision tree regressor feature importances¶

In [41]:
pd.DataFrame({'feature_name':dcr_best.feature_names_in_, 'feature_importance': dcr_best.feature_importances_}).sort_values('feature_importance', ascending=False)